Part 4. Regression (part b)
OLS estimator: \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2\]
Could solve by trying every combination of \(b_0, b_1, \ldots, b_K\).
Let’s try to understand a more efficient solution: \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]
\[(\hat{\beta_0}, \hat{\beta_1}) = \underset{(b_0, b_1) \in \mathbb{R}^{2}}{\arg\min} \, \frac{1}{n} \sum_{i =1}^n\, \left(Y_i - (b_0 + b_1X_i)\right)^2\]
Expanding the sum, \(S\), we get
\[\begin{align} S &= \sum_{i = 1}^n \left(Y_i^2 - 2\beta_0 Y_i - 2\beta_1 X_iY_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right) \\ &= \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2 \end{align}\]
We want to choose \(\beta_0\) and \(\beta_1\) to minimize \(S\).
How would you choose \(\beta_0\) to minimize \(S = -2\beta_0 + \beta_0^2\)?
Observing that the function is flat only at the minimum, we use differentiation to obtain the first order condition:
\[\frac{\partial S}{\partial \beta_0} = -2 + 2 \beta_0 = 0\]
So the solution is \(\beta_0 = 1\).
How would you choose \(\beta_0\) and \(\beta_1\) to minimize \(S = -2\beta_0 + \beta_0^2 - 3 \beta_1 + \beta_1^2 + \beta_0 \beta_1\)?
Taking same approach, first order conditions are
\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 + 2 \beta_0 + \beta_1 = 0 \\ \frac{\partial S}{\partial \beta_1} &= -3 + \beta_0 + 2\beta_1 = 0 \end{align}\]
A system of equations!
\[\begin{align} 2 \beta_0 + \beta_1 = 2 \\ \beta_0 + 2\beta_1 = 3 \end{align}\]
By isolating and substituting, we get \(\beta_0 = 1/3\) and \(\beta_1 = 4/3\).
To find \(\beta_0\) and \(\beta_1\) that minimize \[S = \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2,\] we use the same approach.
\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 \sum Y_i + 2 n \beta_0 + 2\beta_1 \sum X_i &= 0\\ \frac{\partial S}{\partial \beta_1} &= -2 \sum X_i Y_i + 2 \beta_0 \sum X_i + 2 \beta_1 \sum X_i^2 &= 0 \end{align}\]
System of equations:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
Isolating \(\beta_0\) in the first equation, we get \[ \beta_0 = \frac{\sum Y_i}{n} - \beta_1 \frac{\sum X_i}{n}\]
Substituting into second equation, rearranging, dividing top & bottom by \(n\), we get \[\beta_1 = \frac{\frac{\sum X_iY_i}{n} - \frac{\sum X_i}{n}\frac{\sum Y_i}{n}}{\frac{\sum X_i^2}{n} - \left(\frac{\sum X_i}{n}\right)^2}\]
These are the sample analogues of the bivariate BLP (where \(\beta_1 = \frac{\text{Cov}[X,Y]}{{\textrm V}\,[X]}\)).
We have shown how, with a bit of calculus, we can derive the OLS estimator in the case with one predictor.
If we want \(K > 1\) predictors, could do the same thing:
A horrible mess.
To make things easier, we turn to some linear algebra.
A matrix \(\mathbb{B}\) is a rectangular table of numbers, e.g.
\[ \mathbb{B} = \begin{bmatrix} 1 & 2 \\ 4 & 0 \\ -1 & 3 \end{bmatrix}\]
This is a \(3\times 2\) matrix (rows \(\times\) columns).
The transpose of \(\mathbb{B}\), written \(\mathbb{B}^T\) or \(\mathbb{B}'\), is
\[ \mathbb{B}^T = \begin{bmatrix} 1 & 4 & -1\\ 2 & 0 & 3 \end{bmatrix}.\]
\(k\)th row of \(\mathbb{B}\) is \(k\)th column of \(\mathbb{B}^T\) (and vice versa).
When we multiply a matrix \(\mathbb{B}\) with \(K\) columns by a column vector of length \(K\), \(\mathbf{D}\), we get a column vector of length \(K\):
R# equivalent ways to make B:
B <- cbind(c(1, 4, -1), c(2, 0, 3))
(B <- matrix(data = c(1, 4, -1, 2, 0, 3), ncol = 2)) [,1] [,2]
[1,] 1 2
[2,] 4 0
[3,] -1 3
Can only multiply \(\mathbb{B}\) and \(\mathbf{D}\) if number of columns in \(\mathbb{B}\) equals number of rows in \(\mathbf{D}\).
Recall our system of equations:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
We can write this in matrix form as
\[ \mathbb{B} \boldsymbol{\beta} = \mathbf{D} \]
where
\[\mathbb{B} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix}, \, \, \, \,\, \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \, \, \, \,\, \mathbf{D} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]
The inverse of a scalar (number) \(a\) is written \(a^{-1}\), and \(a \times a^{-1} = 1\). (e.g. \(3^{-1} = 1/3\), so \(3 \times 1/3 = 1\).)
We multiply both sides by the inverse of \(a\) when we do
\[\begin{align} a x &= b \\ x &= b/a \end{align}\]
The inverse of a matrix \(\mathbb{B}\) is written \(\mathbb{B}^{-1}\).
We can similarly solve our problem using the inverse of \(\mathbb{B}\):
\[\begin{align} \mathbb{B} \boldsymbol{\beta} &= \mathbf{D} \\ \boldsymbol{\beta} &= \mathbb{B}^{-1} \mathbf{D} \end{align}\]
This is a useful way to represent the solution to a system of equations.
There are manual procedures for inverting matrices. We use R’s solve() function:
Recall our toy dataset, and the system of equations \(\mathbb{B} \beta = \mathbf{D}\) that represents the first order conditions for the \(\beta\) vector that minimizes the sum of squared residuals:
Using solve(), we compute \(\hat{\beta} = \mathbb{B}^{-1}\mathbf{D}\) to solve the system of equations:
So we have (1) a way to represent systems of equations and (2) a way to solve them.
The remaining bottleneck: getting the system of equations from the data (expand sum of squared residuals, take derivative, etc).
Let \(\mathbb{X}\) be the \(n \times K + 1\) regressor matrix (a column of 1s and 1 column per predictor):
\[\mathbb{X} = \begin{bmatrix} 1 & X_{[1]1} & X_{[2]1} & \cdots & X_{[K]1} \\ 1 & X_{[1]2} & X_{[2]2} & \cdots& X_{[K]2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{[1]n} & X_{[2]n} & \cdots & X_{[K]n} \end{bmatrix}\]
Let \(\boldsymbol{\beta}\) be the \(K+1\)-length column vector of coefficients: \[\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{K} \end{bmatrix}\]
Let \(\mathbf{Y}\) be the \(n\)-length column vector of outcomes: \[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{n} \end{bmatrix}\]
For a given coefficient vector \(\boldsymbol{\beta}\),
OLS estimator in matrix form:
\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) \]
Now we do minimization via differentiation for this version of the problem.
Expand the sum:
\[ (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) = \mathbf{Y}^T \mathbf{Y} - \mathbf{Y}^T \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta}^T \mathbb{X}^T \mathbf{Y} + \boldsymbol{\beta}^T\mathbb{X}^T \mathbb{X} \boldsymbol{\beta}\]
Differentiating with respect to \(\boldsymbol{\beta}\) gives us the first order condition(s), \[2 \mathbb{X}^T \mathbb{X}\boldsymbol{\beta} - 2 \mathbb{X}^T \mathbf{Y} = 0,\]
which becomes the system of equations \[\mathbb{X}^T \mathbb{X}\boldsymbol{\beta} = \mathbb{X}^T \mathbf{Y} \]
with solution \[ \boldsymbol{\beta} = \left(\mathbb{X}^T \mathbb{X}\right)^{-1} \mathbb{X}^T \mathbf{Y}.\]
Our system of equations was:
\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]
In matrix form:
\[\begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]
This is \(\mathbb{X}^T \mathbb{X} \boldsymbol{\beta} = \mathbb{X}^T \mathbb{Y}\):
\[\begin{align}\mathbb{X}^T \mathbb{X} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_3 \end{bmatrix}\\ &= \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2\end{bmatrix} \end{align}\]
\[\begin{align}\mathbb{X}^T \mathbf{Y} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \\ &= \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix} \end{align}\]
Most important to know about OLS:
Today we have added some intuition about how OLS is estimated:
For walk-through of the optimization part of matrix OLS, see Ben Lambert’s econometrics videos:
Also see video on variance of OLS estimator given homoskedasticity
Our goal in this class was to “understand topics every social scientist needs to know”:
Review list of (mis)conceptions in Lecture 0!
Quant methods is about clarifying what can be learned from observing and counting things in the world.
This can be surprisingly difficult! But we hope you
Remember: often data collection is most important, and with great data we usually don’t need fancy methods.